Detecting Bulk Document Leakage
نویسندگان
چکیده
Recently enterprises and corporations have reported several incidents of leakage of sensitive data and plagiarism of copyrighted information on the Web. We address the Bulk Document Leakage (BDL) problem: Given a large set of sensitive or copyrighted documents in an enterprise, determine if a portion of the document set has leaked (or has been plagiarized) and been published on the Web. An adversary who wishes to evade detection may partially tamper the content before publishing it. We present an automated tamper-proof low complexity algorithm to solve the BDL problem. We extract embedded signatures from sensitive documents and use them in conjunction with search engines to determine whether near-duplicate versions of the document (or portions of it) are available on the Web. The embedded signature is tamper-proof; even if an adversary partially modifies a document, our mechanism can detect duplicate copies. Also, if a duplicate copy is present in the Web, our system can detect such a copy with a small number of queries to a search engine. We have tested the validity and tamper-proof aspect of our algorithms over a wide range of documents and corpora gathered from different large enterprises. Based on access logs of a large enterprise, we show real-world evidence of bulk leakage across several other domains.
منابع مشابه
Content-based data leakage detection using extended fingerprinting
Protecting sensitive information from unauthorized disclosure is a major concern of every organization. As an organization’s employees need to access such information in order to carry out their daily work, data leakage detection is both an essential and challenging task. Whether caused by malicious intent or an inadvertent mistake, data loss can result in significant damage to the organization...
متن کاملCoBAn: A context based model for data leakage prevention
A new context-based model (CoBAn) for accidental and intentional data leakage prevention (DLP) is proposed. Existing methods attempt to prevent data leakage by either looking for specific keywords and phrases or by using various statistical methods. Keyword-based methods are not sufficiently accurate since they ignore the context of the keyword, while statistical methods ignore the content of t...
متن کاملSimple Procedures for Detecting Network Attachment in IPv6
This document is subject to BCP 78 and the IETF Trust’s Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملImpact of quantum mechanical tunneling on off-leakage current in double-gate MOSFET using a quantum drift-diffusion model
With the growing use of wireless electronic systems, off-state leakage current in MOSFETs appears as one of the major physical limitations. Measurements of quantum tunnel current between source-drain (S-D) have recently shown that it will become detrimental in bulk MOSFET architecture for channel lengths around 5nm and at low temperature (≤100K) [1]. In this paper we investigate, using a 2D qua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008